R for genetics/genomics?R anyhow to analyse non genetic dataRRadegenet, pegas, poppr, and hierfstatR Markdown to produce a report of the analyses, incorporating the workflow and resultsThis document is an R Markdown document. An R Markdown document is the combination of a Markdown document with chunks of R code in the middle that are evaluated using an R package called knitr.
R Markdown document can be used interactively or they can be turned into beautiful HTML document or other (e.g. PDF, Word, …) if you click on the button “Knit” on top.
The benefits of using an R Markdown document is that you have your text, scripts, results, plots and tables in a single place.
You can edit the current R Markdown file to add your own notes to it (but save it under a different name, so that you can go back to the original version if you need to) and to try out or even modify the R code.
If you prefer to see pure R code only, just do:
# file.remove("R4G_course.R") # uncomment and run to remove existing file
knitr::purl("R4G_course.Rmd")
Note: it will create a file called R4G_course.R but it won’t overwrite it if it already exists. So, if you want some changes to be taken into account, make sure you delete the R file beforehand (e.g., by running the first line in the chunk above).
Just write plain text in the script. You can use different number of stars to indicate:
*italic* -> italic**bold** -> bold***both*** -> bothknitr syntaxUse ```{r} to start an R chunk and ``` to close it.
For example if you type 1 + 1 within a chunk, it will be displayed as:
1 + 1
## [1] 2
Then, to evaluate the chunk, just press CTRL-R after having put your mouse cursor inside the chunk (anywhere).
You can learn more about R Markdown on this online book or if you already know a little, the R Studio cheatsheet will help you to keep it all in mind.
R packages readyWe will use the R packages adegenet, pegas, poppr, and hierfstat for the analyses and ggplot2, lattice and viridisLite for the plots. If you have already installed before, you can simply load the packages as follow:
library(adegenet)
library(pegas)
library(poppr)
library(hierfstat)
library(ggplot2)
library(lattice)
library(viridisLite)
Note 1: If one of the package is missing, then you must install it BEFORE loading them. For example, to install adegenet do:
install.packages("adegenet")
Note 2: To be able to “knit” the R Markdown, you will also need other R packages but try knitting and RStudio should handle the rest for you.
We also add a few extra functions we coded for you:
source("scripts/tools.R")
## [1] "All functions succesfully loaded"
Rgenepop formatWe will be using a very common type of input file for microsatellite data, which was originally developed for a stand-alone program called Genepop. You will learn how to create a genepop file with Jörns tomorrow. Today we will use one that we have made for you.
Here is the basic structure:
comment line
locus-1, locus-2, locus-3, ...
pop
Ind-1 , 100102 135135 204208 ...
Ind-2 , 100100 131139 200208 ...
Ind-3 , 102102 131139 200204 ...
Ind-4 , 000000 135139 208208 ...
...
the first line is a “comment line” - you can keep notes on your project here. E.g. “24 animals, 8 loci”
the second line contains the names of the loci (locus-1, …).
the third line has a “pop flag” that indicates the next samples belong to the same population.
below this (until the next pop flag), every line is a multi-locus genotype per individual belonging to the population
Is everyone comfortable with the terms locus, allele and genotype?
RIn the folder data in this R Studio project you can find the genepop file called microbov_small.gen.
We will create an R object that contains the data in microbov_small.gen using the read.genepop() function of adegenet. Here we need to provide the path of the file as "/data/microbov_small.gen" using file =, and some information about how microsatellite data is encoded in the file using ncode =. (And here we also set the argument quiet = TRUE but you don’t have to, this is just to shorten this document by not displaying some messages.)
myData <- read.genepop(file = "data/microbov_small.gen", ncode = 3, quiet = TRUE)
Why did we use ncode = 3?
Now the data is read into R.
If there had been a problem with the data file (e.g. incorrectly formatted), then we would have got an error message. Let’s try:
badFormat <- read.genepop(file = "data/badFormatting.gen", ncode = 3, quiet = TRUE)
## Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent
We “broke” this input file by including fewer locus names than data. The error message is telling us that the length of dimnames [1] (= number of locus names) does not correspond to the number of columns containing the data.
As you can see, R provides us a clue about what the problem is.
Another way we can generate an error is by providing wrong arguments to the function, despite having a properly formatted file:
badInput <- read.genepop(file = "data/microbov_small.gen", ncode = 2, quiet = TRUE)
## Error in read.genepop(file = "data/microbov_small.gen", ncode = 2, quiet = TRUE): some alleles are not encoded with 2 characters
## Check 'ncode' argument
Unsurprisingly, there will be an error if you type the path to the file incorrectly.
One way to limit such issues is to list all genepop files in the folder containing the data (to make sure that the file is where you are trying to read it):
dir(path = "data/", pattern = "*.gen")
## [1] "badFormatting.gen" "lynx.gen" "microbov_small.gen" "nancycats.gen"
R representations of your datagenind formatWe haven’t looked at our in data yet!
myData
## /// GENIND OBJECT /////////
##
## // 160 individuals; 15 loci; 143 alleles; size: 130.5 Kb
##
## // Basic content
## @tab: 160 x 143 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 5-14)
## @loc.fac: locus factor for the 143 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: read.genepop(file = "data/microbov_small.gen", ncode = 3, quiet = TRUE)
##
## // Optional content
## @pop: population of each individual (group size range: 15-31)
Geeky note: For more details on the content of the genind object, just try:
str(myData)
## Formal class 'genind' [package "adegenet"] with 11 slots
## ..@ tab : int [1:160, 1:143] 2 1 1 2 1 1 0 2 1 2 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:160] "AFBIBOR9503" "AFBIBOR9504" "AFBIBOR9505" "AFBIBOR9506" ...
## .. .. ..$ : chr [1:143] "INRA63.183" "INRA63.181" "INRA63.177" "INRA63.175" ...
## ..@ loc.fac : Factor w/ 15 levels "INRA63","INRA5",..: 1 1 1 1 1 1 1 2 2 2 ...
## ..@ loc.n.all: Named int [1:15] 7 5 12 5 10 9 7 7 12 9 ...
## .. ..- attr(*, "names")= chr [1:15] "INRA63" "INRA5" "ETH225" "ILSTS5" ...
## ..@ all.names:List of 15
## .. ..$ INRA63: chr [1:7] "183" "181" "177" "175" ...
## .. ..$ INRA5 : chr [1:5] "137" "141" "143" "139" ...
## .. ..$ ETH225: chr [1:12] "147" "157" "139" "141" ...
## .. ..$ ILSTS5: chr [1:5] "190" "186" "194" "184" ...
## .. ..$ HEL5 : chr [1:10] "149" "151" "163" "165" ...
## .. ..$ HEL1 : chr [1:9] "103" "109" "105" "107" ...
## .. ..$ INRA35: chr [1:7] "102" "104" "120" "110" ...
## .. ..$ ETH152: chr [1:7] "191" "195" "197" "193" ...
## .. ..$ INRA23: chr [1:12] "213" "215" "197" "199" ...
## .. ..$ ETH10 : chr [1:9] "209" "211" "207" "217" ...
## .. ..$ HEL9 : chr [1:11] "153" "159" "165" "155" ...
## .. ..$ CSSM66: chr [1:12] "181" "189" "185" "193" ...
## .. ..$ INRA32: chr [1:12] "162" "178" "180" "184" ...
## .. ..$ ETH3 : chr [1:11] "117" "125" "115" "127" ...
## .. ..$ BM2113: chr [1:14] "140" "142" "122" "134" ...
## ..@ ploidy : int [1:160] 2 2 2 2 2 2 2 2 2 2 ...
## ..@ type : chr "codom"
## ..@ other : NULL
## ..@ call : language read.genepop(file = "data/microbov_small.gen", ncode = 3, quiet = TRUE)
## ..@ pop : Factor w/ 7 levels "AFBIBOR9522",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..@ strata : NULL
## ..@ hierarchy: NULL
or for even more geeky details:
print.AsIs(myData)
As you will see, the genind object is a particular kind of list. Technically it is a S4 object. Instead of accessing elements with $, when using S4 objects elements (called slots) are typically accessible with @ (even if the programmer behind adegenet made it possible to use $ anyway).
Since manipulating such objects is a little complicated, you should use the accessors, whenever available:
nInd(myData) # Number of individuals
## [1] 160
nLoc(myData) # Number of loci
## [1] 15
nPop(myData) # Number of populations
## [1] 7
locNames(myData) # Names of loci
## [1] "INRA63" "INRA5" "ETH225" "ILSTS5" "HEL5" "HEL1" "INRA35" "ETH152" "INRA23" "ETH10" "HEL9" "CSSM66" "INRA32" "ETH3" "BM2113"
alleles(myData) # List of all alleles
## $INRA63
## [1] "183" "181" "177" "175" "179" "185" "171"
##
## $INRA5
## [1] "137" "141" "143" "139" "149"
##
## $ETH225
## [1] "147" "157" "139" "141" "153" "149" "155" "143" "159" "151" "137" "145"
##
## $ILSTS5
## [1] "190" "186" "194" "184" "182"
##
## $HEL5
## [1] "149" "151" "163" "165" "167" "155" "157" "153" "161" "159"
##
## $HEL1
## [1] "103" "109" "105" "107" "101" "117" "113" "111" "115"
##
## $INRA35
## [1] "102" "104" "120" "110" "108" "114" "106"
##
## $ETH152
## [1] "191" "195" "197" "193" "201" "199" "205"
##
## $INRA23
## [1] "213" "215" "197" "199" "209" "205" "211" "203" "207" "217" "201" "193"
##
## $ETH10
## [1] "209" "211" "207" "217" "219" "221" "215" "223" "213"
##
## $HEL9
## [1] "153" "159" "165" "155" "161" "163" "167" "157" "149" "169" "151"
##
## $CSSM66
## [1] "181" "189" "185" "193" "197" "183" "199" "187" "179" "195" "171" "191"
##
## $INRA32
## [1] "162" "178" "180" "184" "202" "182" "176" "174" "164" "160" "204" "168"
##
## $ETH3
## [1] "117" "125" "115" "127" "103" "123" "129" "131" "119" "109" "113"
##
## $BM2113
## [1] "140" "142" "122" "134" "136" "146" "138" "130" "124" "150" "128" "132" "126" "144"
nAll(myData, onlyObserved = TRUE) # Number of alleles for each locus
## INRA63 INRA5 ETH225 ILSTS5 HEL5 HEL1 INRA35 ETH152 INRA23 ETH10 HEL9 CSSM66 INRA32 ETH3 BM2113
## 7 5 12 5 10 9 7 7 12 9 11 12 12 11 14
indNames(myData) # Names of individuals
## [1] "AFBIBOR9503" "AFBIBOR9504" "AFBIBOR9505" "AFBIBOR9506" "AFBIBOR9507" "AFBIBOR9508" "AFBIBOR9509" "AFBIBOR9510" "AFBIBOR9511" "AFBIBOR9512"
## [11] "AFBIBOR9513" "AFBIBOR9514" "AFBIBOR9515" "AFBIBOR9516" "AFBIBOR9517" "AFBIBOR9518" "AFBIBOR9519" "AFBIBOR9520" "AFBIBOR9523" "AFBIBOR9521"
## [21] "AFBIBOR9522" "AFBIZEB9453" "AFBIZEB9454" "AFBIZEB9455" "AFBIZEB9456" "AFBIZEB9457" "AFBIZEB9458" "AFBIZEB9459" "AFBIZEB9460" "AFBIZEB9461"
## [31] "AFBIZEB9462" "AFBIZEB9463" "AFBIZEB9464" "AFBIZEB9465" "AFBIZEB9466" "AFBIZEB9467" "AFBIZEB9468" "AFBIZEB9469" "AFBIZEB9470" "AFBIZEB9471"
## [41] "AFBIZEB9472" "AFBIZEB9473" "AFBIZEB9474" "AFBIZEB9475" "AFBIZEB9476" "AFBIZEB9477" "AFBIZEB9478" "AFBIZEB9479" "AFBIZEB9480" "AFBIZEB9481"
## [51] "AFBIZEB9482" "AFBIZEB9483" "AFBTLAG9402" "AFBTLAG9403" "AFBTLAG9404" "AFBTLAG9405" "AFBTLAG9406" "AFBTLAG9407" "AFBTLAG9408" "AFBTLAG9409"
## [61] "AFBTLAG9410" "AFBTLAG9411" "AFBTLAG9412" "AFBTLAG9413" "AFBTLAG9414" "AFBTLAG9415" "AFBTLAG9416" "AFBTND202" "AFBTND205" "AFBTND206"
## [71] "AFBTND207" "AFBTND208" "AFBTND209" "AFBTND211" "AFBTND212" "AFBTND213" "AFBTND214" "AFBTND215" "AFBTND216" "AFBTND217"
## [81] "AFBTND221" "AFBTND222" "AFBTND223" "AFBTND233" "AFBTND241" "AFBTND242" "AFBTND244" "AFBTND248" "AFBTND253" "AFBTND254"
## [91] "AFBTND255" "AFBTND257" "AFBTND258" "AFBTND259" "AFBTND284" "AFBTND285" "AFBTND292" "AFBTSOM9352" "AFBTSOM9353" "AFBTSOM9354"
## [101] "AFBTSOM9355" "AFBTSOM9356" "AFBTSOM9357" "AFBTSOM9358" "AFBTSOM9359" "AFBTSOM9360" "AFBTSOM9361" "AFBTSOM9362" "AFBTSOM9363" "AFBTSOM9364"
## [111] "AFBTSOM9365" "AFBTSOM9366" "AFBTSOM9367" "AFBTSOM9368" "AFBTSOM9369" "AFBTSOM9370" "AFBTSOM9371" "AFBTSOM9372" "AFBTSOM9373" "AFBTSOM9374"
## [121] "FRBTAUB9061" "FRBTAUB9062" "FRBTAUB9063" "FRBTAUB9064" "FRBTAUB9065" "FRBTAUB9066" "FRBTAUB9067" "FRBTAUB9068" "FRBTAUB9069" "FRBTAUB9070"
## [131] "FRBTAUB9071" "FRBTAUB9072" "FRBTAUB9073" "FRBTAUB9074" "FRBTAUB9075" "FRBTAUB9076" "FRBTBAZ15654" "FRBTBAZ15655" "FRBTBAZ25576" "FRBTBAZ25578"
## [141] "FRBTBAZ25950" "FRBTBAZ25954" "FRBTBAZ25956" "FRBTBAZ25957" "FRBTBAZ26078" "FRBTBAZ26092" "FRBTBAZ26097" "FRBTBAZ26110" "FRBTBAZ26112" "FRBTBAZ26352"
## [151] "FRBTBAZ26354" "FRBTBAZ26375" "FRBTBAZ26386" "FRBTBAZ26388" "FRBTBAZ26396" "FRBTBAZ26400" "FRBTBAZ26401" "FRBTBAZ26403" "FRBTBAZ26439" "FRBTBAZ26457"
popNames(myData) # Name of the last individual in each population
## [1] "AFBIBOR9522" "AFBIZEB9483" "AFBTLAG9416" "AFBTND292" "AFBTSOM9374" "FRBTAUB9076" "FRBTBAZ26457"
Some of the accessors can also be used to redefine some information (to handle with great care!):
#let's give the pops some names:
#myPops <- c("Pop1", "Pop2", "Pop3", "Pop4", "Pop5", "Pop6", "Pop7")
myPops <- paste0("Pop", 1:nPop(myData))
popNames(myData) <- myPops
popNames(myData)
## [1] "Pop1" "Pop2" "Pop3" "Pop4" "Pop5" "Pop6" "Pop7"
pop(myData)
## [1] Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2
## [32] Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3
## [63] Pop3 Pop3 Pop3 Pop3 Pop3 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4
## [94] Pop4 Pop4 Pop4 Pop4 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop6 Pop6 Pop6 Pop6
## [125] Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7
## [156] Pop7 Pop7 Pop7 Pop7 Pop7
## Levels: Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
data.frame formatNot all R packages dealing with genetics/genomics use genind objects (it would be convenient if they did!). The package hierfstat, for example, just uses plain data.frame where the first column contains the population to which the different individuals belong, and the following columns contain the genotype of the individuals (one locus per column).
In general there are functions to convert one format into another. To convert the genind objects into a data.frame you can use genind2hierfstat() from the package hierfstat:
myData_hierf <- genind2hierfstat(myData)
head(myData_hierf) # Display first 6 rows
loci formatThe package pegas which allows for the computation of many useful metrics relies on an alternative format called loci.
myData_loci <- genind2loci(myData) # or as.loci(myData)
myData_loci
The loci format is more simple than the genind format, but a little more complex than just a naked data.frame.
Geeky note: A loci object is a data.frame (an S3 object) with an additional attribute called "locicol":
str(myData_loci)
## Classes 'loci' and 'data.frame': 160 obs. of 16 variables:
## $ population: Factor w/ 7 levels "Pop1","Pop2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ INRA63 : Factor w/ 18 levels "171/175","175/175",..: 17 16 10 17 10 10 9 17 10 17 ...
## $ INRA5 : Factor w/ 11 levels "137/137","137/139",..: 3 9 9 9 9 4 6 6 6 9 ...
## $ ETH225 : Factor w/ 29 levels "137/147","139/139",..: 19 8 2 10 25 23 19 26 3 28 ...
## $ ILSTS5 : Factor w/ 13 levels "182/184","182/186",..: 11 8 13 6 5 5 6 5 3 8 ...
## $ HEL5 : Factor w/ 28 levels "149/149","151/151",..: 1 6 7 28 17 17 27 26 26 28 ...
## $ HEL1 : Factor w/ 22 levels "101/101","101/103",..: 7 10 4 5 6 5 4 6 5 5 ...
## $ INRA35 : Factor w/ 13 levels "102/102","102/104",..: 2 4 4 4 4 4 4 2 4 4 ...
## $ ETH152 : Factor w/ 14 levels "191/191","191/195",..: 2 7 8 7 1 8 7 8 2 8 ...
## $ INRA23 : Factor w/ 32 levels "193/199","197/199",..: 31 3 4 11 11 4 26 20 19 9 ...
## $ ETH10 : Factor w/ 25 levels "207/207","207/209",..: 8 5 3 12 8 11 7 11 6 6 ...
## $ HEL9 : Factor w/ 29 levels "149/157","151/161",..: 3 6 9 14 6 18 9 4 4 23 ...
## $ CSSM66 : Factor w/ 34 levels "171/183","179/179",..: 10 10 8 33 24 18 22 20 20 20 ...
## $ INRA32 : Factor w/ 27 levels "160/204","162/162",..: 2 2 19 2 6 27 19 21 5 7 ...
## $ ETH3 : Factor w/ 30 levels "103/117","109/117",..: 10 13 8 10 9 10 9 6 25 5 ...
## $ BM2113 : Factor w/ 43 levels "122/122","122/124",..: 41 40 1 41 30 41 28 30 7 37 ...
## - attr(*, "locicol")= int [1:15] 2 3 4 5 6 7 8 9 10 11 ...
head(data.frame(myData_loci), n = 10) # first 10 rows
attr(myData_loci, "locicol") # columns that contain loci
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Let’s check some basic things about the data in myData.
An important consideration for analysis is the amount of missing data in our dataset. Several types of analyses don’t cope very well with missing data. In some cases this may lead to poor estimates, in other cases it may lead to samples or loci not being considered in the analysis.
We can use info_table() from poppr to find out about missing data, and also generate a convenient plot by including plot = TRUE.
missing <- info_table(myData, df = TRUE) # df = TRUE uses a long rather than
# wide format for the df
missing[order(missing$Missing, decreasing = TRUE), ] # We reorder the table to show the highest missing on top
info_table(myData, plot = TRUE, low = "white") ## NB: run in the console for best display
## Locus
## Population INRA63 INRA5 ETH225 ILSTS5 HEL5 HEL1 INRA35 ETH152 INRA23 ETH10 HEL9 CSSM66 INRA32 ETH3 BM2113 Mean
## Pop1 . . . 0.0476 . . . . . . . . . . . 0.0032
## Pop2 . . . . 0.0323 . . . 0.0645 . . . 0.0323 . . 0.0086
## Pop3 . . . 0.3333 . . . 0.0667 . . . . . . . 0.0267
## Pop4 . . . 0.1333 . 0.0333 0.0333 . 0.0333 . . . . . . 0.0156
## Pop5 . . 0.0435 0.1739 . . . . . . . . . 0.0435 . 0.0174
## Pop6 . . . 0.2500 . . . . . . . . . . . 0.0167
## Pop7 0.1667 0.2917 . 0.2500 0.0833 0.2083 . 0.2917 0.0417 0.3333 0.0417 0.2917 0.3750 0.3750 0.2083 0.1972
## Total 0.0250 0.0437 0.0063 0.1500 0.0187 0.0375 0.0063 0.0500 0.0250 0.0500 0.0063 0.0437 0.0625 0.0625 0.0312 0.0413
Geeky note: it is tricky but you can modify anything in this plot even when info_table() as no option for it:
theplot <- ggplot_build(last_plot())
theplot$data[[4]]$size <- 3
theplot$data[[4]]$angle <- 45
plot(ggplot_gtable(theplot))
We can see that there is missing data at several loci, and that this is particularly bad for our population Pop7.
As a general rule of thumb, we try to keep missing data below 5% in order to minimize the impact on analyses.
How do you think we can achieve this?
Using genotype_curve() we can check how many loci we need in order to discriminate our genotypes. In other words, we can use it to answer the question: how many loci do we need to distinguish all our samples?
From 1 to \(n-1\), where \(n\) is the total number of loci, this poppr function randomly samples columns (= loci) 100 times without replacement, and counts the number of unique genotypes observed.
gencurv <- genotype_curve(myData)
How many loci do we need in order to discriminate our genotypes?
Why is it run to \(n-1\), and not to \(n\)?
Geeky note: you can get the exact numbers of distinct genotypes given each number of loci considered:
apply(gencurv, 2, range)
## NumLoci
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## [1,] 12 44 96 132 146 154 156 156 157 157 157 157 157 157
## [2,] 44 131 154 157 157 157 157 157 157 157 157 157 157 157
As missing data is concentrated in a few loci and populations, rather than being thinly spread throughout the dataset, we can remove the worst offenders: a locus (ILSTS5) and a population (Pop7).
We can ‘drop’ a population from the genind object using the following code, if we know the number (ID) of the population in the list of populations.
myData[pop = -c(7), drop = TRUE] # drop = TRUE updates the number of remaining alleles!
Sometimes it is more useful to be able to remove populations by their name. For this we need a bit more code:
removePop <- c("Pop7")
myData <- myData[pop = !popNames(myData) %in% removePop, drop = TRUE]
Removing loci from a genind object works in a similar fashion as removing a population. If you know the number (ID) of a locus, then you can use the short code myData[loc = -c(1)]. But again, it is better to use names.
removeLoc <- c("ILSTS5")
myData <- myData[loc = !locNames(myData) %in% removeLoc, drop = TRUE]
Now let’s check the new version of myData:
myData
## /// GENIND OBJECT /////////
##
## // 136 individuals; 14 loci; 137 alleles; size: 111.1 Kb
##
## // Basic content
## @tab: 136 x 137 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 5-14)
## @loc.fac: locus factor for the 137 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, loc = ..1, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 15-31)
Let’s also rerun info_table() to see what’s changed:
info_table(myData, plot = TRUE, low = "white")
## Locus
## Population INRA63 INRA5 ETH225 HEL5 HEL1 INRA35 ETH152 INRA23 ETH10 HEL9 CSSM66 INRA32 ETH3 BM2113 Mean
## Pop1 . . . . . . . . . . . . . . .
## Pop2 . . . 0.0323 . . . 0.0645 . . . 0.0323 . . 0.0092
## Pop3 . . . . . . 0.0667 . . . . . . . 0.0048
## Pop4 . . . . 0.0333 0.0333 . 0.0333 . . . . . . 0.0071
## Pop5 . . 0.0435 . . . . . . . . . 0.0435 . 0.0062
## Pop6 . . . . . . . . . . . . . . .
## Total . . 0.0074 0.0074 0.0074 0.0074 0.0074 0.0221 . . . 0.0074 0.0074 . 0.0053
We have significantly reduced the amount of missing data in myData.
How else could we have solved this?
How about our ability to discriminate genotypes?
genotype_curve(myData)
How does that compare to the previous genotype accumulation curve?
How does the maximum value for MLG compare to the number of individuals?
Depending on how we obtain our genetic samples, we may not know if we have genotyped the same individual more than once.
For example, if we have been genotyping samples collected in a non-invasive way (e.g. hair or faeces), then we don’t really have a way of knowing if we have obtained more than one sample per individual.
We know how many loci we need to distinguish individuals by using genotype_curve() above, so now we can check if we have genotyped an individual more than once.
There is an R package called alleleMatch that can investigate this in detail. However, it is not actively maintained. Thankfully, poppr has some basic functionality to examine this. Using mlg() we can check for the number of unique multilocus genotypes in myData.
Note that all software that deals with genetic data has a way of labelling missing data. For example, many R packages use “NA”. As missing data can contribute to observed differences between genotypes, but such differences only reflect absence of data and not necessarily a biological difference, the mlg() function is coded in a way to ignore differences between genotypes resulting from missing data.
mlg(myData)
## #############################
## # Number of Individuals: 136
## # Number of MLG: 133
## #############################
## [1] 133
We have 136 samples, but only 133 unique multilocus genotypes. This suggests that 3 samples correspond to replicates from the same individual(s). (We introduced that in the data on purpose to show you how to deal with such issues.)
Note: the mlg() function did not tell us which samples are the same. For this we can use mlg.id(), but this gives us a very long list because it also includes the genotypes that occur only once. Let us check the beginning of the output from mlg.id() using head():
head(mlg.id(myData))
## $`1`
## [1] "AFBIZEB9466"
##
## $`2`
## [1] "FRBTAUB9073"
##
## $`3`
## [1] "AFBTSOM9356"
##
## $`4`
## [1] "AFBIZEB9471"
##
## $`5`
## [1] "FRBTAUB9067"
##
## $`6`
## [1] "FRBTAUB9064"
What we want to know is which are the genotypes that occur more than once?
We achieve this by looking for entries in the list with a length greater than one.
myMLG <- mlg.id(myData)
myMLG[lengths(myMLG) > 1]
## $`104`
## [1] "AFBIZEB9482" "AFBIZEB9483"
##
## $`120`
## [1] "AFBIBOR9520" "AFBIBOR9523"
##
## $`125`
## [1] "AFBTSOM9373" "AFBTSOM9374"
As we do not want to include replicates of samples in our data, we should now remove one of each replicated genotype. We will remove AFBIZEB9483, AFBIBOR9523, and AFBTSOM9374. We can follow the same principle we used to remove loci and a population earlier:
removeInd <- c("AFBIZEB9483", "AFBIBOR9523","AFBTSOM9374")
myData <- myData[!indNames(myData) %in% removeInd, drop = TRUE]
myData
## /// GENIND OBJECT /////////
##
## // 133 individuals; 14 loci; 137 alleles; size: 109.1 Kb
##
## // Basic content
## @tab: 133 x 137 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 5-14)
## @loc.fac: locus factor for the 137 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 15-30)
Now we have 133 unique genotypes in myData.
If this were data from non-invasively collected samples, we may expect there to be more duplicate genotypes that we have not detected yet. Why?
Geeky note:
If you have to remove a lot of samples, you can do it easily by replacing the previous call by:
samplesToKeep <- unlist(lapply(myMLG, function(x) x[1])) # Capture first occurence for each MLG
myData <- myData[indNames(myData) %in% samplesToKeep, drop = TRUE]
info_table() can also be used to find out about the ploidy of your data. As the results are presented for all samples, here just the first couple lines of output (achieved by using head())
head(info_table(myData, type = "ploidy"), n = 2)
## Loci
## Samples INRA63 INRA5 ETH225 HEL5 HEL1 INRA35 ETH152 INRA23 ETH10 HEL9 CSSM66 INRA32 ETH3 BM2113
## AFBIBOR9503 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## AFBIBOR9504 2 2 2 2 2 2 2 2 2 2 2 2 2 2
How many alleles expect in a tetraploid?
Could this number be 1 for a diploid?
We can summarize the results from this large table by counting the different outcomes in it:
table(info_table(myData, type = "ploidy"), useNA = "always") # Counts different values
##
## 2 <NA>
## 1852 10
The numbers make sense:
nLoc(myData) * nInd(myData)
## [1] 1862
Another important consideration is to check if our loci are variable and informative. Otherwise it is useless to keep them.
What do we mean with uninformative? The simplest example is if all individuals are the same at a given locus. This might be due to experimental design, if we designed our microsatellite primers using genetic data from another species (e.g. we use genetic data from the cat to design primers to genotype lions). But there are also processes working at the population level that could lead to this.
Can you think of an example?
There is a common way to examine if a locus is informative. For this, we examine the so-called minor allele frequency (MAF): the frequency of the second most common allele at a locus. If this is higher than a threshold (often 1% = frequency of 0.01), a locus is considered informative.
The function informloci() can be used to detect uninformative loci and remove them.
informloci(myData)
## cutoff value: 1.50375939849624 % ( 2 samples ).
## MAF : 0.01
##
## All sites polymorphic
## /// GENIND OBJECT /////////
##
## // 133 individuals; 14 loci; 137 alleles; size: 109.2 Kb
##
## // Basic content
## @tab: 133 x 137 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 5-14)
## @loc.fac: locus factor for the 137 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, loc = ..1, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 15-30)
The important information we just obtained is the display “All sites polymorphic”. That implies that, in our case, all loci are informative and they are kept. If we had uninformative loci, and wanted to remove them, we would have run the following:
myData <- informloci(myData)
You can easily explore genetic differences between individuals – within each population – using our home made little function:
pairwise_similarity(myData, pop = "Pop1")
We can represent the results visually:
lapply(popNames(myData), function(pop) hist(pairwise_similarity(myData, pop = pop, as_vector = TRUE), main = pop, xlim = c(0, 1))) # don't forget as_vector = TRUE
How would you do the same thing across all populations?
hist(pairwise_similarity(myData, as_vector = TRUE), xlim = c(0, 1)) # don't forget as_vector = TRUE
Consider the following two populations with different genotype frequencies:
AA Aa aa Pop1 50 0 50 Pop2 25 50 25
Do the genotype frequencies of the two populations differ?
Do the allele frequencies of the two populations differ?
Genotype frequencies are the same in males and females.
Individuals mate at random with respect to their genotype at this particular locus (panmixia).
Meiosis is fair.
There is no new genetic material (no mutation).
There is no gene flow (no migration).
The population is of infinite size (no drift).
All matings produce the same average number of offspring (no selection on fecundity).
There are no differences among genotypes in the probability of survival (no selection on survival).
Generations do not overlap (no selection on reproductive rate).
Using frequency of alleles in the current generation …
\(p_t = f(A)\)
\(q_t = f(a)\)
… the frequencies in the next generation can be calculated:
\(f(AA) = p_t^{2}\)
\(f(Aa) = 2p_tq_t\)
\(f(aa) = q_t^{2}\)
\(p_{t+1} = f(AA) + \dfrac{f(Aa)}{2} = p_t^2 + \dfrac{2p_tq_t}{2} = p_t^2 + p_tq_t = p_t^2 + p_t(1-p_t) = p_t^2 + p_t - p_t^2 = p_t\)
\(q_{t+1} = f(aa) + \dfrac{f(Aa)}{2} = q_t^2 + \dfrac{2p_tq_t}{2} = q_t^2 + p_tq_t = q_t^2 + (1-q_t)q_t = q_t^2 + q_t - q_t^2 = q_t\)
Under HWE, the allelic frequencies remain the same over time!
Evolution is a change in allele frequencies in a population over time
When a locus in a population is in HWE, it is not evolving: allele frequencies will stay the same across generations.
If HWE assumptions are not met, evolution can happen (allele frequencies may change).
Mutation, non-random mating, gene flow, genetic drift (caused by finite population size), and natural selection violate HWE assumptions and are thus “mechanisms” by which evolution may proceed.
The HWE is thus the null model of micro-evolution.
So is it good/bad for a locus to be in HWE?
Very broadly speaking, there are two types of population genetics analyses you can do:
those that assume loci are in HWE, and
those that do not.
It is thus important to realise if your data reject the HWE assumptions!
With hw.test() from pegas we can test for HWE across our populations:
hw.test(myData, B = 0) ## for Monte Carlo test instead of asymptotic,
## use large value of B (>= 1000)
## chi^2 df Pr(chi^2 >)
## INRA63 88.42099 21 3.017570e-10
## INRA5 42.19441 10 6.924324e-06
## ETH225 124.64277 66 1.745822e-05
## HEL5 180.45114 45 0.000000e+00
## HEL1 184.99600 28 0.000000e+00
## INRA35 60.39321 21 1.113412e-05
## ETH152 47.09923 21 9.106719e-04
## INRA23 77.80636 66 1.516741e-01
## ETH10 70.74219 36 4.795813e-04
## HEL9 130.20509 55 4.867201e-08
## CSSM66 82.65631 66 8.075179e-02
## INRA32 209.96734 66 1.110223e-16
## ETH3 44.98197 55 8.303643e-01
## BM2113 184.20731 91 2.747856e-08
Here we have conducted a \(\chi^{2}\)-test based on observed and expected genotype frequencies calculated from the allele frequencies.
Would we expect loci to be in HWE if we consider all populations together as we have done?
The Wahlund effect refers to the reduction in the number of heterozygotes due to population structure.
Consider two populations:
AA Aa aa Pop1 50 0 0 Pop2 0 0 50
Here, each population is in HWE.
However, if we treat them as a single population, there are no heterozygotes, and this merged-population is not in HWE.
Let’s check our data by population. We will split myData by population using the seppop function. This creates a list of genind objects, with every entry in the list consisting of a population.
To apply the HWE test with hw.test() to every item in the list (ie every population), we use (as above) the function lapply().
lapply(seppop(myData), hw.test)
## $Pop1
## chi^2 df Pr(chi^2 >) Pr.exact
## INRA63 0.8333333 6 0.99114998 1.000
## INRA5 12.9166667 6 0.04437854 0.008
## ETH225 21.4390432 36 0.97397104 0.882
## HEL5 37.7654321 28 0.10291932 0.238
## HEL1 16.0969388 15 0.37563846 0.404
## INRA35 10.0888889 6 0.12095828 0.195
## ETH152 10.2835732 10 0.41597667 0.088
## INRA23 19.0958270 21 0.57899284 0.205
## ETH10 13.9506173 21 0.87171109 0.824
## HEL9 26.7469136 28 0.53205905 0.738
## CSSM66 29.6798155 15 0.01312988 0.181
## INRA32 48.1087311 36 0.08546174 0.066
## ETH3 3.8776014 10 0.95269969 0.971
## BM2113 18.1530612 21 0.63929854 0.549
##
## $Pop2
## chi^2 df Pr(chi^2 >) Pr.exact
## INRA63 45.692857 21 1.403507e-03 0.000
## INRA5 19.552921 10 3.377642e-02 0.032
## ETH225 54.411111 28 1.999921e-03 0.000
## HEL5 59.525763 21 1.506288e-05 0.000
## HEL1 3.949720 10 9.495867e-01 0.897
## INRA35 13.606771 10 1.916952e-01 0.030
## ETH152 11.280563 10 3.360814e-01 0.075
## INRA23 9.775034 21 9.816874e-01 0.742
## ETH10 13.687899 15 5.493191e-01 0.223
## HEL9 52.909954 45 1.952332e-01 0.201
## CSSM66 27.401167 36 8.478470e-01 0.591
## INRA32 87.022377 36 4.114494e-06 0.131
## ETH3 11.089471 15 7.462265e-01 0.493
## BM2113 19.367498 28 8.864145e-01 0.882
##
## $Pop3
## chi^2 df Pr(chi^2 >) Pr.exact
## INRA63 5.94687500 10 0.81970664 0.546
## INRA5 7.60416667 6 0.26856035 0.095
## ETH225 7.54933378 3 0.05630431 0.065
## HEL5 9.13333333 10 0.51949831 0.191
## HEL1 0.60000000 1 0.43857803 1.000
## INRA35 3.30740741 3 0.34661300 0.137
## ETH152 3.62388427 1 0.05695575 0.065
## INRA23 0.80740741 3 0.84769447 1.000
## ETH10 0.80740741 3 0.84769447 1.000
## HEL9 2.76962525 6 0.83715593 0.841
## CSSM66 1.51859504 1 0.21783223 0.244
## INRA32 0.01783591 1 0.89375751 1.000
## ETH3 11.76480716 10 0.30110547 0.342
## BM2113 5.93357821 10 0.82081249 0.876
##
## $Pop4
## chi^2 df Pr(chi^2 >) Pr.exact
## INRA63 37.158133 10 5.313702e-05 0.035
## INRA5 1.228844 3 7.460949e-01 0.494
## ETH225 29.540706 21 1.016135e-01 0.088
## HEL5 32.331967 28 2.612127e-01 0.418
## HEL1 57.190924 21 3.366698e-05 0.000
## INRA35 3.473136 3 3.242630e-01 0.264
## ETH152 2.228763 6 8.975031e-01 0.775
## INRA23 27.404740 21 1.578545e-01 0.099
## ETH10 20.406585 15 1.568841e-01 0.003
## HEL9 23.035714 15 8.338452e-02 0.006
## CSSM66 18.778597 36 9.920040e-01 0.413
## INRA32 7.384172 3 6.061047e-02 0.046
## ETH3 34.353741 10 1.608659e-04 0.039
## BM2113 13.117038 21 9.044692e-01 0.688
##
## $Pop5
## chi^2 df Pr(chi^2 >) Pr.exact
## INRA63 13.791312 15 0.5414118 0.530
## INRA5 7.588642 6 0.2698152 0.057
## ETH225 25.176667 21 0.2395936 0.144
## HEL5 4.665958 10 0.9123471 0.918
## HEL1 1.454694 3 0.6927657 1.000
## INRA35 2.296331 1 0.1296800 0.218
## ETH152 1.845937 6 0.9333083 0.839
## INRA23 16.209393 15 0.3682734 0.202
## ETH10 11.739444 15 0.6986344 0.193
## HEL9 40.562233 36 0.2761318 0.325
## CSSM66 21.933673 21 0.4033407 0.229
## INRA32 4.361975 10 0.9295439 0.468
## ETH3 32.563125 28 0.2522038 0.074
## BM2113 12.740575 10 0.2385404 0.135
##
## $Pop6
## chi^2 df Pr(chi^2 >) Pr.exact
## INRA63 0.117551 1 7.317059e-01 1.000
## INRA5 2.346667 3 5.036398e-01 0.638
## ETH225 9.120676 10 5.206906e-01 0.379
## HEL5 8.699421 28 9.998207e-01 0.985
## HEL1 34.147448 10 1.743706e-04 0.121
## INRA35 34.863905 6 4.579186e-06 0.006
## ETH152 3.484444 6 7.460381e-01 0.346
## INRA23 18.391038 15 2.426658e-01 0.102
## ETH10 23.278571 15 7.840172e-02 0.185
## HEL9 30.950400 15 8.920097e-03 0.113
## CSSM66 31.604938 36 6.776680e-01 0.778
## INRA32 3.998186 6 6.769219e-01 0.766
## ETH3 6.064198 21 9.993764e-01 1.000
## BM2113 35.235556 36 5.047583e-01 0.139
Be careful during your interpretation: the population sizes differ and so does the statistical power!
unlist(lapply(seppop(myData), nInd))
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## 20 30 15 30 22 16
Some loci, in some populations, are still not in HWE.
Why may this be?
Another important consideration for analysing genetic data is the independence of loci.
Imagine two loci that are physically very close on the same chromosome. Will alleles at these loci be segregating independently?
Do you know how this is measured?
For this reason, and many others (e.g. Fisher’s runaway), it is possible that alleles at one locus may be associated with alleles at another locus. In other words, we may observe non-random association of alleles at different loci. This is called linkage disequilibrium.
Why is this important? If the dependence between loci is not recognised, the results will be biased.
We can look at this with the ia() function of poppr. This function calculates an index of association over all loci in the genind object:
ia(myData, sample = 999)
## Ia p.Ia rbarD p.rD
## 0.84836718 0.00100000 0.06567056 0.00100000
Here we can see the distribution of expected association between alleles at different loci when there is no linkage (dark grey barplot), and the estimate for association among alleles in our total dataset (i.e., all loci and all pops at the same time).
It appears we have a significant association between loci, at least when merging observations across all populations.
What if we look at the same test per population? To answer this question, we use seppop() and lapply() as before, but now we include the function that we want to run on each pop (here, ia()) and the arguments for such a function. Here, the argument sample defines the number of permutations used to draw the distribution of the association index under the null hypothesis; we want a large number, e.g., 999:
lapply(seppop(myData), ia, sample = 999)
## $Pop1
## Ia p.Ia rbarD p.rD
## 0.25067155 0.04600000 0.01941745 0.04600000
##
## $Pop2
## Ia p.Ia rbarD p.rD
## 0.018941131 0.502000000 0.001466489 0.502000000
##
## $Pop3
## Ia p.Ia rbarD p.rD
## 0.68851067 0.00100000 0.05471501 0.00100000
##
## $Pop4
## Ia p.Ia rbarD p.rD
## -0.018113328 0.521000000 -0.001413501 0.521000000
##
## $Pop5
## Ia p.Ia rbarD p.rD
## 0.46856847 0.00500000 0.03641272 0.00500000
##
## $Pop6
## Ia p.Ia rbarD p.rD
## -0.29912511 0.98000000 -0.02351494 0.98000000
In some populations we observe significant LD.
Is this important?
If we want to know which two loci are in LD, we can use the pair.ia() function. Let’s consider, for example, what happens for Pop6:
pair.ia(seppop(myData)[["Pop6"]])
#pair.ia(myData[pop = 6]) # same thing
We can see that LD is not the same for all pairs of loci.
Let us change the plot so that it is easier for colour-blind people to see (e.g. for Dan!). For this we add some additional arguments for colour: low = "black" and high = "green".
pair.ia(myData[pop = "Pop6"], low = "black", high = "green")
We now display the linkage disequilibrium for each population:
lapply(seppop(myData), pair.ia, low = "black", high = "green")
There are some typical descriptive statistics you will find in most population genetics papers. These summary statistics give us an overview of some important features of populations under study.
We will focus on the basic ones describing variation within populations and describing variation between populations.
Note: many of these can be calculated over all populations as well as per population or between pairs of populations. This offers information at different scales.
If we are interested in genetic variation in natural populations we often consider heterozygosity.
High heterozygosity means a lot of genetic variability.
Low heterozygosity means little genetic variability.
Let’s consider heterozygosity in a simple system with two alleles at a locus: A and a. Let’s also assume that this population is in HWE.
Then:
\(p = f(A)\)
\(q = f(a)\)
So under HWE, we obtain the following mating table:
A a A p^2 pq a qp q^2
So \(pq + qp = 2pq\) gives the frequency of heterozygote genotypes.
In this two-allele system, heterozygosity is highest at \(p = 0.5\). Let’s visualize:
freqP <- seq(from = 0, to = 1, by = 0.01)
freqQ <- 1 - freqP
plot(2*freqP*freqQ ~ freqQ, type = "l", las = 1,
ylab = "frequency of heterozygote genotypes",
xlab = "frequency of allele 'a'")
text(0, 0, "AA")
text(1, 0, "aa")
arrows(0.45, 0, 0.05, 0)
text(0.5, 0, "Aa")
arrows(0.55, 0, 0.95, 0)
As you can imagine, this becomes more complex when there are more alleles per locus!
Questions?
Now we will calculate the observed heterozygosity in our populations, as well as the expected heterozygosity if the populations are in HWE.
For this we use the summary() function from adegenet. We can extract the observed heterozygosity using $Hobs and the expected heterozygosity using $Hexp.
To make life convenient, you can extract the values for every locus, place them in a data.frame and compute the difference:
heteroz <- data.frame(Hexp = summary(myData)$Hexp, Hobs = summary(myData)$Hobs)
heteroz$diff <- heteroz$Hexp - heteroz$Hobs
heteroz
We can also visualize how observed heterozygosity is different from expected heterozygosity by subtracting one from the other.
If you remember from before, we saw that loci were not in HWE over all populations.
barplot(heteroz$diff, names.arg = rownames(heteroz),
main = "Heterozygosity: expected-observed",
xlab = "", ylab = "Hexp - Hobs", font.lab = 2, las = 2)
Or we can use another representation, using ggplot2 for a change:
heteroz$loci <- rownames(heteroz) ## ggplot needs names stored as a column
ggplot(heteroz, aes(y = Hexp, x = Hobs)) +
geom_segment(aes(y = Hexp - 0.01, yend = Hobs, xend = Hobs), linetype = "dashed") +
geom_text(aes(label = loci), size = 3) +
geom_abline(slope = 1) +
scale_x_continuous(limits = range(c(heteroz$Hobs, heteroz$Hexp))) +
scale_y_continuous(limits = range(c(heteroz$Hobs, heteroz$Hexp))) +
labs(title = "Heterozygosity: expected vs observed") +
xlab(expression(bold("Observed heterozygosity"))) +
ylab(expression(bold("Expected heterozygosity"))) +
theme_classic() +
coord_fixed()
Now let’s consider observed and expected heterozygosity by population. Here, we will focus on the mean across loci.
hierfstat conveniently provides us with functions to calculate this:
Hs(myData)
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## 0.7225571 0.7179429 0.5372143 0.5762500 0.6160571 0.6196429
Ho(myData)
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## 0.6964286 0.6751714 0.5105429 0.5054143 0.5796286 0.5982143
Again, we can store the results for observed and expected heterozygosity into a data.frame and compute the difference:
heteroz_per_pop <- data.frame(Hexp = Hs(myData), Hobs = Ho(myData))
heteroz_per_pop$diff <- heteroz_per_pop$Hexp - heteroz_per_pop$Hobs
heteroz_per_pop
We have already tested if these are significantly different above (while testing for the HWE).
Geeky note:
To look at both the effect of the locus and the population at once, you can write some custom code:
heteroz_matrix <- do.call("cbind", lapply(seppop(myData),
function(x) summary(x)$Hexp - summary(x)$Hobs))
heteroz_matrix
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## INRA63 -0.00625 0.175555556 0.055555556 0.10000000 0.09814050 0.029296875
## INRA5 0.18500 -0.006666667 0.157777778 -0.04388889 0.16632231 -0.005859375
## ETH225 -0.08875 0.128888889 -0.180000000 -0.01111111 0.09637188 -0.078125000
## HEL5 0.14750 0.293697979 0.080000000 0.05111111 -0.03822314 -0.097656250
## HEL1 -0.08375 -0.031111111 -0.055555556 0.18668252 -0.06301653 0.015625000
## INRA35 -0.04000 0.132777778 0.191111111 0.05053508 0.06508264 0.134765625
## ETH152 0.07250 -0.002222222 0.221938776 -0.09833333 -0.01859504 0.121093750
## INRA23 0.07375 -0.058673469 -0.068888889 -0.02913199 0.12603306 0.214843750
## ETH10 -0.08875 -0.097777778 -0.068888889 0.24333333 0.13326446 -0.078125000
## HEL9 -0.05500 0.033888889 -0.006666667 0.22833333 -0.07541322 0.066406250
## CSSM66 -0.06375 -0.042777778 0.124444444 0.10555556 -0.03719008 -0.001953125
## INRA32 0.17750 0.026159334 -0.002222222 0.10611111 0.07954545 -0.169921875
## ETH3 -0.05625 -0.028333333 -0.197777778 -0.07277778 -0.08390023 -0.109375000
## BM2113 -0.07000 -0.103888889 -0.142222222 0.02333333 -0.14772727 -0.021484375
This can be easily plotted with lattice:
levelplot(heteroz_matrix, col.regions = viridis(100),
main = "Heterozygosity: expected-observed",
xlab = "Locus", ylab = "Population",
scales = list(x = list(rot = 90)))
An effect of population subdivision on genetic variation is the reduction in observed heterozygotes. As we have seen previously (Wahlund effect).
The extent of reduction in observed heterozygotes can be used to quantify the level of differentiation between sub-populations.
This quantification is formalised in a series of hierarchical F-statistics.
–> We are using a measure of departure from HWE to quantify the extent of differentiation between populations.
F-statistics also provide a way to quantify the level of heterozygosity at the individual level, and if/how this departs from expectations of HWE in the population.
How are these linked?
F-statistics, from https://en.wikipedia.org/wiki/F-statistics
We will focus on the two most common F-statistics: \(F_{IS}\) and \(F_{ST}\).
\(F_{IS}\) is known as the inbreeding coefficient.
\(F_{ST}\) is known as the fixation index.
Let’s calculate \(F_{ST}\) for our previous example.
AA Aa aa Pop1 50 0 0 Pop2 0 0 50
Here, \(p = 0.5\) and \(q = 0.5\).
Our expected heterozygosity for the total population (i.e., Pop1 and Pop2 together) is given by
\(H_T = 2pq = 2 \times 0.5 \times 0.5 = 0.5\)
As we have no variation in the sub-populations (i.e., Pop1 or Pop2) \(H_S = 0\) because \(2pq = 2 \times 1 \times 0\)
Then
\(F_{ST} = (0.5 - 0) / 0.5 = 1.0\)
Note: this has been a very simple exploration of how F-statistics are calculated. This has been expanded upon, and not all software or R packages calculate F-stats in the same way. For example, some account for factors such as how individuals disperse (island model vs stepping-stone model), the mutation process (infinite alleles model vs step-wise mutation model) and other bias (e.g. taking into account sampling bias). It is thus important to report (and if possible understand) which method you use. Do indicate in your paper/thesis which functions you used, from which package to avoid any ambiguity.
To cite a package:
citation(package = "hierfstat")
##
## To cite package 'hierfstat' in publications use:
##
## Jerome Goudet and Thibaut Jombart (2021). hierfstat: Estimation and Tests of Hierarchical F-Statistics. R package version 0.5-10.
## https://CRAN.R-project.org/package=hierfstat
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {hierfstat: Estimation and Tests of Hierarchical F-Statistics},
## author = {Jerome Goudet and Thibaut Jombart},
## year = {2021},
## note = {R package version 0.5-10},
## url = {https://CRAN.R-project.org/package=hierfstat},
## }
packageVersion("hierfstat") ## don't forget to cite the version number!
## [1] '0.5.10'
Now let’s get back to myData.
The package hierfstat provide a function that directly computes the \(F_{IS}\) and \(F_{ST}\), as well as other things (see ?basic.stats for details).
basic.stats(myData)
## $perloc
## Ho Hs Ht Dst Htp Dstp Fst Fstp Fis Dest
## INRA63 0.5220 0.6146 0.7692 0.1546 0.8001 0.1856 0.2010 0.2319 0.1506 0.4814
## INRA5 0.4523 0.5432 0.5669 0.0237 0.5716 0.0285 0.0419 0.0498 0.1672 0.0623
## ETH225 0.7549 0.7510 0.8233 0.0723 0.8377 0.0867 0.0878 0.1035 -0.0051 0.3482
## HEL5 0.6528 0.7460 0.7974 0.0514 0.8076 0.0616 0.0644 0.0763 0.1249 0.2427
## HEL1 0.5188 0.5266 0.7290 0.2024 0.7695 0.2429 0.2776 0.3156 0.0149 0.5130
## INRA35 0.2909 0.3919 0.4139 0.0220 0.4183 0.0264 0.0531 0.0631 0.2579 0.0434
## ETH152 0.5127 0.5779 0.6597 0.0818 0.6761 0.0981 0.1239 0.1451 0.1129 0.2325
## INRA23 0.6185 0.6796 0.7552 0.0755 0.7703 0.0906 0.1000 0.1176 0.0900 0.2828
## ETH10 0.6078 0.6309 0.7417 0.1109 0.7639 0.1330 0.1495 0.1741 0.0365 0.3604
## HEL9 0.6384 0.6883 0.8173 0.1290 0.8431 0.1548 0.1578 0.1836 0.0724 0.4966
## CSSM66 0.6844 0.7166 0.7576 0.0409 0.7658 0.0491 0.0540 0.0642 0.0450 0.1734
## INRA32 0.4852 0.5357 0.6418 0.1061 0.6630 0.1273 0.1653 0.1921 0.0942 0.2743
## ETH3 0.7540 0.6773 0.7580 0.0807 0.7741 0.0969 0.1065 0.1251 -0.1133 0.3001
## BM2113 0.8266 0.7667 0.8668 0.1000 0.8868 0.1200 0.1154 0.1354 -0.0780 0.5146
##
## $overall
## Ho Hs Ht Dst Htp Dstp Fst Fstp Fis Dest
## 0.5942 0.6319 0.7213 0.0894 0.7391 0.1073 0.1239 0.1451 0.0596 0.2914
However this is not exactly what we need since the default display shows metrics computed over populations.
For the \(F_{IS}\) we want one value per population. We can derive that information from basic.stats() with a little trick. We can get the summary statistics per population, if we append basic.stats() with the appropriate statistic, e.g. using $Fis
basic.stats(myData)$Fis # Fis for every locus (rows) by population (columns)
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## INRA63 0.0142 0.2424 0.1189 0.2038 0.1564 0.1176
## INRA5 0.3391 0.0050 0.3143 -0.0870 0.3636 0.0217
## ETH225 -0.0912 0.1855 -0.2584 0.0022 0.1502 -0.0744
## HEL5 0.2096 0.4105 0.1515 0.0915 -0.0288 -0.1048
## HEL1 -0.0839 -0.0378 -0.1667 0.2950 -0.1595 0.0667
## INRA35 -0.0721 0.2260 0.3939 0.2843 0.3437 0.4444
## ETH152 0.1331 0.0132 0.5357 -0.1674 -0.0092 0.2473
## INRA23 0.1207 -0.0714 -0.1144 -0.0290 0.2100 0.3059
## ETH10 -0.0840 -0.1105 -0.1144 0.5229 0.2896 -0.0843
## HEL9 -0.0483 0.0592 0.0244 0.3436 -0.0785 0.2063
## CSSM66 -0.0611 -0.0350 0.3488 0.1594 -0.0307 0.0299
## INRA32 0.2525 0.0541 0.0000 0.2574 0.1853 -0.2342
## ETH3 -0.0556 -0.0253 -0.2366 -0.1415 -0.0843 -0.1392
## BM2113 -0.0643 -0.1138 -0.1629 0.0507 -0.1840 0.0051
Fis <- colMeans(basic.stats(myData)$Fis) # close but not equal to (Hs(myData) - Ho(myData))/Hs(myData)
Fis
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## 0.03633571 0.05729286 0.05957857 0.12756429 0.08027143 0.05771429
These pairwise \(F_{IS}\) values correspond to the mean of \(F_{IS}\) values over all loci.
As any estimate, the \(F_{IS}\) is measured with some uncertainty. To represent this uncertainty, we can compute the 95\(\%\) confidence interval of the mean \(F_{IS}\) value using a simple bootstrap:
Fis_CI <- boot.ppfis(myData, nboot = 999)
Fis_CI
## $call
## boot.ppfis(dat = myData, nboot = 999)
##
## $fis.ci
## ll hl
## 1 -0.0341 0.1124
## 2 -0.0123 0.1486
## 3 -0.0782 0.1880
## 4 0.0276 0.2181
## 5 -0.0194 0.1467
## 6 -0.0492 0.1289
We can combine and format the information in a single table:
Fis_table <- data.frame(Fis = Fis, Fis_CI$fis.ci)
colnames(Fis_table)[2:3] <- c("lwr", "upr")
Fis_table$lwr[Fis_table$lwr < 0] <- 0 # lwr than 0 does not make sense (bootstrap artifact)
Fis_table
Concerning \(F_{ST}\) values, we want values between pairs of populations. For this, we need to use another function from hierfstat called pairwise.WCfst:
pairFst <- pairwise.WCfst(genind2hierfstat(myData)) # does not work directly on genind input :-(
pairFst
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## Pop1 NA 0.05267027 0.14892780 0.11471852 0.07502668 0.1507617
## Pop2 0.05267027 NA 0.19146245 0.15721627 0.13674589 0.1981238
## Pop3 0.14892780 0.19146245 NA 0.08769554 0.06282488 0.2605006
## Pop4 0.11471852 0.15721627 0.08769554 NA 0.04867005 0.2296986
## Pop5 0.07502668 0.13674589 0.06282488 0.04867005 NA 0.1870359
## Pop6 0.15076166 0.19812376 0.26050061 0.22969861 0.18703587 NA
We can plot this output:
levelplot(pairFst, col.regions = rev(grey.colors(30)))
As for the \(F_{IS}\), these pairwise \(F_{ST}\) values correspond to the mean of \(F_{ST}\) values over all loci.
As for the \(F_{IS}\), we can compute the 95\(\%\) confidence interval of the mean \(F_{ST}\) value:
boot.ppfst(genind2hierfstat(myData), nboot = 999)
##
## Upper limit above diagonal
## Lower limit below diagonal
##
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## Pop1 NA 0.09260812 0.18991490 0.15980130 0.1048000 0.2434596
## Pop2 0.02367808 NA 0.26467184 0.19194497 0.2012672 0.2633425
## Pop3 0.10185057 0.12938895 NA 0.11331047 0.0956785 0.3496074
## Pop4 0.06966839 0.12306800 0.06286626 NA 0.0746029 0.3068141
## Pop5 0.04689766 0.08699458 0.03261141 0.02021752 NA 0.2805284
## Pop6 0.08241241 0.13506427 0.16959510 0.15537923 0.1037866 NA
It is worth checking if your populations have alleles that cannot be found in other populations. These are called “private alleles.”
We can use the private_alleles() function of poppr to get information about which alleles can only be found in a population.
Alleles are labelled as locus.allele.
private <- private_alleles(myData, report = "data.frame") ## report influences the output format
private
Let’s rework the output to actually reveals what the private alleles are for each population:
private_only <- private[private$count > 0, ]
private_only[order(private_only$population), ]
Or same using the tidyverse package dplyr:
library(dplyr)
private |>
filter(count > 0) |>
arrange(population) |>
rename(nb_indiv = count)
We can count the number of private alleles in each population:
private_alleles_per_pop <- rowSums(private_alleles(myData) > 0)
private_alleles_per_pop
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6
## 5 6 2 3 5 9
Private alleles will impact measures of differentiation between populations. This is especially true, if their frequency is high (i.e., if they are common).
Note that the number of private alleles may be influenced by many factors, including sample size.
What would you expect the relationship to be?
Let’s explore this using a plot:
ind_per_pop <- sapply(seppop(myData), nInd)
plot(ind_per_pop, private_alleles_per_pop,
xlab = "Sample size", ylab = "Number of private alleles", las = 1,
col = NULL)
text(y = private_alleles_per_pop,
x = ind_per_pop,
labels = names(ind_per_pop))
Individual-based analyses frequently do not make assumptions about HWE, because they look at the differences between individuals.
But consider how this kind of information must be presented.
individual x individual
It is not surprising that we have come up with ways to visualize this kind of information.
Two very common ways are trees (or networks) and PCA.
One benefit of displaying information in this manner is that we can see if there are groupings of individuals that are more similar to each other than they are to other individuals.
adegenet and poppr offer us the ability to generate other distance measures. Most of the ones offered in poppr work for individuals (rather than populations).
Note: several of these distance measures make strong assumptions about the biological nature of our genetic samples.
Let’s try another measure that makes no assumptions: Prevosti’s distance, which we can use with prevosti.dist()
mynj_prevosti <- nj(prevosti.dist(seppop(myData)[["Pop1"]]))
plot(mynj_prevosti, type = "unrooted", cex = 0.8)
Note: we did not need to use 1 - in this case, because the function prevosti.dist() directly provides a distance.
Let us compare the two trees we just made to see if they match:
cophyloplot(mynj, mynj_prevosti,
assoc = cbind(mynj$tip.label, mynj_prevosti$tip.label), cex = 0.8)
Are you happy with that?
Now that we know how to make a NJ tree for a small dataset, we can do it for all samples in myData.
bignj <- nj(prevosti.dist(myData))
plot(bignj, type = "unrooted", cex = 0.8)
Geeky note: we can improve this plot by plotting colours and changing the type of tree. Here a possibility would be a “fan” plot:
plot(bignj, type = "fan", show.tip.label = FALSE, x.lim = c(-0.7, 0.7), no.margin = TRUE)
tiplabels(text = rownames(myData@tab),
frame = "none",
col = rainbow(nPop(myData))[as.numeric(myData@pop)], cex = 0.8, offset = 0.05)
legend("topleft", fill = rainbow(nPop(myData)),
legend = popNames(myData), bty = "n",
title = "Population")
Or perhaps a “radial” plot:
plot(bignj, type = "radial", show.tip.label = FALSE, x.lim = c(-0.7, 0.7), no.margin = TRUE)
tiplabels(text = rownames(myData@tab),
frame = "none",
col = rainbow(nPop(myData))[as.numeric(myData@pop)], cex = 0.8, offset = 0.05)
legend("topleft", fill = rainbow(nPop(myData)),
legend = popNames(myData), bty = "n",
title = "Population")
An alternative to neighbour joining is to reduce the dimensionality of the problem so that it can be plotted in 2 dimensions instead of one dimension per locus. Many options exist to do that, but the most common one is the so-called Principal Component Analysis, or PCA for short.
To draw the PCA very quickly, you can use hierfstat:
plot(indpca(myData))
For doing a little more, you are going to use the combination of the packages ade4 and adegenet instead.
This is how you run a PCA:
myData_matrix <- scaleGen(myData, center = FALSE, scale = FALSE, NA.method = "mean")
mypca <- dudi.pca(myData_matrix, center = TRUE, scale = FALSE, scannf = FALSE, nf = Inf)
The PCA creates new dimensions…
head(mypca$li[, 1:4]) ## only show head for first 4 axes
which are uncorrelated…
head(zapsmall(cor(mypca$li))[, 1:4]) ## only show head for first 4 axes
## Axis1 Axis2 Axis3 Axis4
## Axis1 1 0 0 0
## Axis2 0 1 0 0
## Axis3 0 0 1 0
## Axis4 0 0 0 1
## Axis5 0 0 0 0
## Axis6 0 0 0 0
and which capture a decreasing amount of variation of the original loci:
barplot(mypca$eig,
names.arg = colnames(mypca$li),
cex.names = 0.5,
col = heat.colors(length(mypca$eig)),
las = 2, ylab = "Inertia")
or in cumulated percentage:
barplot(cumsum(100*mypca$eig/sum(mypca$eig)),
names.arg = colnames(mypca$li),
cex.names = 0.5,
col = rev(heat.colors(length(mypca$eig))),
las = 2, log = "y",
ylab = "Cumulative proportion of variance explained")
There are many ways to plot a PCA, but here we are interested in projecting the individuals into the new loci space, so we use the function s.class():
s.class(mypca$li, fac = pop(myData),
col = rainbow(nPop(myData)), grid = FALSE, xax = 1, yax = 2, cpoint = 0)
s.label(mypca$li, add.plot = TRUE, boxes = FALSE, clabel = 0.5)
add.scatter.eig(mypca$eig[1:10], xax = 1, yax = 2, ratio = 0.15)
Here is how each allele contribute to the axes while displaying only the names of the private alleles for population 6:
s.arrow(mypca$co[, 1:2], boxes = FALSE, cpoint = 0, clabel = 0)
private_alleles_pop6 <- colnames(private_alleles(myData))[private_alleles(myData)["Pop6", ] != 0]
s.arrow(mypca$co[private_alleles_pop6, 1:2], boxes = FALSE, clabel = 0.8, add.plot = TRUE)
The PCA does not force the group to be different, it just shows the overall structure. In contrast, the DAPC is a method (also implemented in adegenet) that allows to explore whether some combinations of alleles would allow the characterisation of distinct groups.
This is particularly useful to perform assignments of samples from unknown origins to populations!
Unfortunately, certain steps cannot work with missing data. Because we want to perform all steps on the same data, we thus start removing missing values from the data. The function missingno() from poppr does exactly that. Here we delete genotypes with missing data (but you have other options):
myData_noNA <- missingno(myData, type = "geno", cutoff = 0) # xvalDapc cannot handle missing values so we remove them
##
## Found 103 missing values.
##
## 9 genotypes contained missing values greater than 0%
##
## Removing 9 genotypes: AFBIZEB9455, AFBIZEB9468, AFBIZEB9480, AFBTLAG9403, AFBTND212, AFBTND223, AFBTND292, AFBTSOM9355, AFBTSOM9357
We can now start by identifying the genetic groups that approximate the data the best:
grp <- find.clusters(myData_noNA, n.pca = Inf, n.clust = 4) ## best not to define n.clust and choose interactivelly, keep n.pca = Inf here
## Warning: did not converge in 100000 iterations
Here only 4 groups should be defined and unlike when using find.cluster() we do need here to reduce n.pca not to overfit:
We can now run the DAPC:
dapc1 <- dapc(myData_noNA, pop = grp$grp, n.pca = 10, n.da = Inf) ## for choice of n.pca see below (it is very important!)
scatter(dapc1)
Note that for a relatively small number of clusters like here, n.da does not have to be anything else than Inf.
An alternative visual representation of the same thing, which superficially mimics the one produced by the famous program Structure, is obtained with compoplot():
compoplot(dapc1, show.lab = TRUE, legend = FALSE, cex.names = 0.4,
lab = paste(pop(myData_noNA), rownames(dapc1$tab)))
Let’s now assign the following individual:
newID <- data.frame(INRA63.183 = 2, INRA63.181 = 0, INRA63.177 = 0,
INRA63.175 = 0, INRA63.179 = 0, INRA63.185 = 0, INRA63.171 = 1,
INRA5.137 = 0, INRA5.141 = 2, INRA5.143 = 0, INRA5.139 = 0,
INRA5.149 = 0, ETH225.147 = 0, ETH225.157 = 1, ETH225.139 = 0,
ETH225.141 = 0, ETH225.153 = 0, ETH225.149 = 0, ETH225.155 = 0,
ETH225.143 = 0, ETH225.159 = 1, ETH225.151 = 0, ETH225.137 = 0,
ETH225.145 = 0, HE5.149 = 0, HE5.151 = 0, HE5.163 = 0,
HE5.165 = 0, HE5.167 = 2, HE5.155 = 0, HE5.157 = 0,
HE5.153 = 0, HE5.161 = 0, HE5.159 = 0, HE1.103 = 1,
HE1.109 = 0, HE1.105 = 1, HE1.107 = 0, HE1.101 = 0,
HE1.117 = 0, HE1.113 = 0, HE1.111 = 0, INRA35.102 = 0,
INRA35.104 = 2, INRA35.120 = 2, INRA35.110 = 0, INRA35.108 = 0,
INRA35.114 = 0, INRA35.106 = 0, ETH152.191 = 0, ETH152.195 = 1,
ETH152.197 = 1, ETH152.193 = 0, ETH152.201 = 0, ETH152.199 = 0,
ETH152.205 = 0, INRA23.213 = 0, INRA23.215 = 0, INRA23.197 = 0,
INRA23.199 = 1, INRA23.209 = 2, INRA23.205 = 0, INRA23.211 = 1,
INRA23.203 = 0, INRA23.207 = 0, INRA23.217 = 0, INRA23.201 = 0,
INRA23.193 = 0, ETH10.209 = 0, ETH10.211 = 0, ETH10.207 = 1,
ETH10.217 = 0, ETH10.219 = 1, ETH10.221 = 0, ETH10.215 = 0,
ETH10.223 = 0, ETH10.213 = 2, HE9.153 = 0, HE9.159 = 0,
HE9.165 = 1, HE9.155 = 0, HE9.161 = 1, HE9.163 = 0,
HE9.167 = 0, HE9.157 = 0, HE9.149 = 0, HE9.169 = 0,
HE9.151 = 0, CSSM66.181 = 0, CSSM66.189 = 1, CSSM66.185 = 1,
CSSM66.193 = 0, CSSM66.197 = 0, CSSM66.183 = 0, CSSM66.199 = 0,
CSSM66.187 = 0, CSSM66.179 = 0, CSSM66.195 = 0, CSSM66.171 = 0,
CSSM66.191 = 0, INRA32.162 = 1, INRA32.178 = 0, INRA32.180 = 0,
INRA32.184 = 0, INRA32.202 = 0, INRA32.182 = 1, INRA32.176 = 0,
INRA32.174 = 0, INRA32.164 = 0, INRA32.160 = 0, INRA32.204 = 0,
INRA32.168 = 0, ETH3.117 = 0, ETH3.125 = 0, ETH3.115 = 2,
ETH3.127 = 0, ETH3.103 = 0, ETH3.123 = 0, ETH3.129 = 0,
ETH3.131 = 0, ETH3.119 = 0, ETH3.109 = 0, ETH3.113 = 0,
BM2113.140 = 0, BM2113.142 = 0, BM2113.122 = 0, BM2113.134 = 0,
BM2113.136 = 1, BM2113.146 = 1, BM2113.138 = 0, BM2113.130 = 0,
BM2113.124 = 0, BM2113.150 = 0, BM2113.128 = 0, BM2113.132 = 0,
BM2113.126 = 0, BM2113.144 = 0)
predict(dapc1, newdata = newID)
## $assign
## [1] 4
## Levels: 1 2 3 4
##
## $posterior
## 1 2 3 4
## [1,] 7.49464e-26 0.09536122 0.000445176 0.9041936
##
## $ind.scores
## LD1 LD2 LD3
## [1,] 4.953179 -0.1024723 2.097223
So we can see that this individual is assigned to one of the four clusters (although it may not be impossible it belongs to another).
But what does it mean? To figure this out, we need to go back to the clustering we started with and compare it to our populations:
table.value(table(pop(myData_noNA), grp$grp), col.lab = paste0("cluster_", 1:4))
As you can see, this cluster corresponds to both pop1 and pop2, so our new individual may come from either of the two populations.
Note that to obtain the best prediction accuracy, it is important to select the right number of PCA axis when running the DAPC. you can estimate this by cross validation using the function xvalDapc():
cross_validation <- xvalDapc(myData_noNA, grp = grp$grp, n.da = Inf, n.rep = 30) # increase n.rep for real usage!
cross_validation$`Number of PCs Achieving Highest Mean Success`
## [1] "10"
As you can see 10 PCA axis leads to best results, which is why we chose that above.
For more details, see this tutorial.